Computational
Literary Studies

A gentle introduction

Maciej Eder (maciej.eder@ut.ee)

28.04.2025


CUDAN Open Lab seminar

introduction

First, what CLS is about

  • Computational Literary Studies
  • Aimed at analyzing (large amounts of) textual data…
  • … by computational techniques

Leon Battista Alberti

Leon Battista Alberti, De componendis cifris, ca. 1466

Computation into criticism

John Burrows, Computation into Criticism, 1987

Distant reading

Franco Moretti, Matt Jockers, Ted Underwood

Sociology of reading

Karina van Dalen-Oskam, Het raadsel literatuur, 2021

Quantitative linguistics

Foundations of CLS

  • Computation into criticism
  • Distant reading
  • Stylometry
  • Authorship attribution
  • Digital humanities
  • Language resources
  • Digital libraries
  • Natural language processing
  • Machine learning
  • Big data

What CLS has to offer

  • Scientific method
    • reproducibility, empirical paradigm, statistical modeling, probabilistic inference, …
  • Scale
    • access to unprecedented amounts of data
  • Accuracy ability to capture patterns invisible to a naked eye

1,000 Polish novels

Combination of factors needed

  • Datasets (language resources)
  • Tools (computer programs)
  • Suitable methodology
  • Computer power (i.e. scientific instruments)

Not possible individually

Research infrastructures

Libraries, journals, publishers, …

Dictionaries at IJP PAN

ELTeC corpus

DraCor

CLS INFRA

An infrastructural project for computational literary studies, founded by Horizon 2020 scheme

infrastructures in DH and CLS

  • in hard sciences, infrastructures are tangible
    • servers, telescopes, accelerators, …
  • in the humanities, institutions are essential
    • libraries, publishing houses, journals, …
  • in DH, multifaceted needs
    • the notion of infrastructure needs reconsideration
    • corpora (FAIR!) but not only

CLS INFRA project

  • text collections (corpora)
    • quality
    • metadata
    • conversion
  • methodology
    • tools (NLP, datavis, …)
    • tool chains
    • methodological considerations
    • bibliographic survey
  • network of scholars
    • training schools
    • short-term research stays
    • collaboration with COST Action

Overarching idea is to connect…

  • People
    • To establish a network of CLS researchers
  • Data
    • To consolidate existing high-quality corpora…
    • …covering prose, drama and poetry
  • Tools
    • To build a chain of NLP tools to analyze texts
  • Methods
    • To provide a survey of state-of-the-art methods

outcomes

activities

  • training schools
    • Prague 2022, Madrid 2023, Vienna 2024
  • workshops
    • DH 2024, DH 2025
  • closing event
    • at CCLS 2025
  • transnationan access fellowships
    • short-term research stays…
    • in one of 6 institutions:

selected deliverables

  • 3.1 Report on the methodological baseline for (computational) literary studies
  • 4.1 Report on the skills matrix for computational literary studies
  • 5.1 Review of the data landscape
  • 6.1 Assembly of existing data

survey of methods

CLS-centric Discord server

Text analysis

Why text analysis?

  • Authorship attribution
  • Forensic linguistics
  • Register analysis
  • Genre recognition
  • Gender differences
  • Translatorial signal
  • Early vs. mature style
  • Style evolution
  • Detecting dementia

stylometry

  • measures stylistic differences between texts
  • oftentimes aimed at authorship attribution
  • relies on stylistic fingerprint, …
  • … aka measurable linguistic features
    • frequencies of function words
    • frequencies of grammatical patterns, etc.
  • proves successful in several applications

How two compare texts?

  • Extracting valuable (i.e. countable) language features from texts
    • frequencies of words 👈
    • frequencies of syllables
    • versification patterns
    • grammatical patterns
    • distribution of topics
  • Comparing these features by means of multivariate analysis
    • distance-based methods 👈
    • neural networks

From words to features

‘It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.’
(J. Austen, Pride and Prejudice)

the” = 4.25%

in” = 3.45%

of” = 1.81%

to” = 1.44%

a” = 1.37%

was” = 1.17%

. . .

From features to similarities

##                              the   and    to     a    of    he     i    in
## Capote_Blood_1966          5.226 2.828 2.350 2.909 2.323 1.907 1.562 1.346
## Capote_Breakfast_1958      4.178 2.489 2.230 3.117 1.829 1.277 3.059 1.273
## Capote_Crossing_1950       4.079 2.848 2.119 2.998 2.134 1.528 1.061 1.428
## Capote_Harp_1951           4.597 2.462 2.326 3.030 1.950 1.103 2.253 1.390
## Capote_Voices_1948         4.661 3.201 1.808 3.146 1.727 1.868 1.237 1.481
## Faulkner_Absalom_1936      5.567 4.802 2.649 1.463 2.145 2.219 0.835 1.374
## Faulkner_Dying_1930        5.268 3.487 2.321 1.883 1.403 2.152 3.102 1.209
## Faulkner_Light_1832        5.611 3.660 2.350 1.940 1.611 3.482 1.325 1.365
## Faulkner_Moses_1942        6.077 4.910 2.275 1.526 1.819 2.685 0.836 1.366
## Faulkner_Sound_1929        4.391 3.046 2.631 1.812 1.259 1.427 3.902 1.167
## Glasgow_Phases_1898        6.065 3.357 1.983 2.625 3.071 2.203 1.639 1.632
## Glasgow_Vein_1935          5.153 2.708 2.389 2.249 2.051 1.563 1.651 1.697
## Glasgow_Virginia_1913      5.447 2.459 2.691 2.045 3.217 1.434 1.582 1.656
## HarperLee_Mockingbird_1960 3.936 2.336 2.451 1.855 1.528 1.934 2.559 1.369
## HarperLee_Watchman_2015    4.283 2.566 2.471 1.921 1.781 1.291 1.730 1.529
## McCullers_HeartIsaLo_1940  6.162 4.020 2.441 2.362 1.905 2.571 1.042 1.681
## McCullers_Member_1946      6.472 4.417 2.210 2.445 1.743 0.816 1.245 1.523
## McCullers_Reflection_1941  7.192 3.382 2.125 2.888 2.399 2.622 0.397 1.902
## OConnor_Everything_1956    5.597 3.198 2.578 2.461 1.965 2.939 0.948 1.700

What we hope to get

stylometry beyond attribution

boosting frequencies

areas of improvement

  • classification method
    • distant-based
    • svm, nsc, knn, …
    • neural networs
  • feature engineering
    • dimension reduction
    • lasso
  • feature choice
    • MFWs
    • POS n-grams
    • character n-grams

relative frequencies

simple normalization

Occurrences of the most frequent words (MFWs):

## 
##  the  and   to    i   of    a   in  was  her   it  you   he  she that  not   my 
## 4571 4748 3536 4130 2224 2326 1484 1127 1551 1391 1895 2138 1338 1250  937 1106

Relative frequencies:

## 
##    the    and     to      i     of      a     in    was    her     it    you 
## 0.0383 0.0398 0.0296 0.0346 0.0186 0.0195 0.0124 0.0094 0.0130 0.0116 0.0159

relative frequencies

The number of occurrences of a given word divided by the total number of words:

\[ f_\mathrm{the} = \frac{n_\mathrm{the}}{ n_\mathrm{the} + n_\mathrm{of} + n_\mathrm{and} + n_\mathrm{in} + ... } \]

In a generalized version:

\[ f_{w} = \frac{n_{w}}{N} \]

relative frequencies

  • routinely used
  • reliable
  • simple
  • intuitive
  • conceptually elegant

words that matter

synonyms

Proportions within synonym groups might betray a stylistic signal:

  • on and upon
  • drink and beverage
  • buy and purchase
  • big and large
  • et and atque and ac

proportions within synonyms

The proportion of on to upon:

\[ f_\mathrm{on} = \frac{n_\mathrm{on}}{ n_\mathrm{on} + n_\mathrm{upon} } \]

The proportion of upon to on:

\[ f_\mathrm{upon} = \frac{n_\mathrm{upon}}{ n_\mathrm{on} + n_\mathrm{upon} } \]

Certainly, they sum up to 1.

‘on’/total vs. ‘on’/(‘upon’ + ‘on’)

‘the’/total vs. ‘the’/(‘of’ + ‘the’)

limitations of synonyms

  • in many cases, several synonyms
    • cf. et and atque and ac in Latin
  • in many cases, no synonyms at all
  • target words might belong to different grammatical categories
  • what are the synonyms for function words?
  • provisional conclusion:
    • synonyms are but a subset of the words that matter

beyond synonyms

semantic similarity

  • target words: synonyms and more
  • e.g. for the word make the target words can involve:
    • perform, do, accomplish, finish, reach, produce, …
    • all their inflected forms (if applicable)
    • derivative words: nouns, adjectives, e.g. a deed
  • the size of a target semantic area is unknown

word vector models

  • trained on a large amount of textual data
  • capable of capturing (fuzzy) semantic relations between words
  • many implementations:
    • word2vec
    • GloVe
    • faxtText

GloVe model: examples

the neighbors of house:

##  house  where  place   room   town houses   farm  rooms    the   left 
##  1.000  0.745  0.688  0.685  0.664  0.651  0.647  0.638  0.637  0.635

the neighbors of home:

##   home return   come coming  going london   back     go   went   came 
##  1.000  0.732  0.717  0.705  0.696  0.689  0.682  0.670  0.666  0.665

the neighbors of buy:

##    buy   sell wanted   want   sold    get   send  wants   give  money 
##  1.000  0.728  0.552  0.539  0.537  0.536  0.535  0.531  0.526  0.510

the neighbors of style:

##    style  quality  fashion   manner     type    taste  manners language 
##    1.000    0.597    0.565    0.560    0.547    0.527    0.518    0.512 
##   proper  english 
##    0.504    0.504

relative frequencies revisited

for a 2-word semantic space, the frequency of the word house:

\[ f_\mathrm{house} = \frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} + n_\mathrm{place} } \]

for a 5-word semantic space, the frequency of the word house:

\[ f_\mathrm{house} = \frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} + n_\mathrm{place} + n_\mathrm{room} + n_\mathrm{town} + n_\mathrm{houses} } \]

for a 7-word semantic space, the frequency of the word house:

\[ f_\mathrm{house} = \frac{n_\mathrm{house}}{ n_\mathrm{house} + n_\mathrm{where} + n_\mathrm{place} + n_\mathrm{room} + n_\mathrm{town} + n_\mathrm{houses} + n_\mathrm{farm} + n_\mathrm{rooms} } \]

will it fly?

experimental setup

  • a corpus of 99 novels in English
  • by 33 authors (3 texts per author)
  • tokenized and classified by the package stylo
    • stratified cross-validation scenario
    • 100 cross-validation folds
    • distance-based classification performed
    • F1 scores reported

distance measures used

  • classic Burrows’s Delta
  • Cosine Delta (Wurzburg)
  • Eder’s Delta
  • raw Manhattan distance

results

results for Cosine Delta

results for Cosine Delta

results for Burrows’s Delta

results for Eder’s Delta

results for Manhattan Distance

the best F1 scores

  • Cosine Delta: 0.96
  • Burrows’s Delta: 0.84
  • Eder’s Delta: 0.83
  • raw Manhattan: 0.77

how good are the results?

  • we know that Cosine Delta outperforms Classic Delta etc.
  • what is the actual gain in performance, then?
  • an additional round of tests performed to get baseline
  • the gain above the baseline reported below

gain for Cosine Delta

gain for Burrows’s Delta

gain for Eder’s Delta

gain for Manhattan Distance

conclusions

  • in each scenario, the gain was considerable
  • the hot spot of performance varied depending on the method…
  • … yet it was spread between 5 and 100 semantic neighbors
  • best classifiers are even better: up. to 12% improvement!

conclusions (cont.)

  • the new method is very simple
  • it doesn’t require any NLP tooling…
  • … except getting a general list of n semantic neighbors for MFWs
  • such a list can be generated once and re-used several times
  • if a rough method of tracing the words that matter was already successful, a bigger gain can be expected with sophisticated language models

Thank you!

mail:

twitter: @MaciejEder

GitHub: https://github.com/computationalstylistics/word_frequencies

appendix

alt semantic space

  • perhaps the n closest neighbors is not the best way to define semantic spaces
  • therefore: testing all the words at the cosine distance of x from the reference word

results for cosine similarities

gain for Cosine Delta

results for delta similarities

gain for Burrows’s Delta

results for Eder’s similarities

gain for Eder’s Delta

exotic distances

What is a distance?

take any two texts:

##                 the   and    to    of     a   was     I    in    he  said   you
## lewis_lion    5.141 3.699 2.295 2.185 2.100 1.346 0.813 1.162 1.087 1.426 1.141
## tolkien_lord1 5.624 3.782 2.074 2.597 1.916 1.313 1.492 1.419 1.221 0.825 0.872

subtract the values vertically:

##           the    and    to     of     a   was      I     in     he  said   you
##        -0.483 -0.083 0.221 -0.412 0.184 0.033 -0.679 -0.257 -0.134 0.601 0.269

then drop the minuses:

##                 the   and    to    of     a   was     I    in    he  said   you
##               0.483 0.083 0.221 0.412 0.184 0.033 0.679 0.257 0.134 0.601 0.269

sum up the obtained values:

## [1] 3.356

Manhattan vs. Euclidean

Euclidean distance

between any two texts represented by two points A and B in an n-dimensional space can be defined as:

\[ \delta_{AB} = \sqrt{ \sum_{i = 1}^{n} (A_i - B_i)^2 } \]

where A and B are the two documents to be compared, and \(A_i,\, B_i\) are the scaled (z-scored) frequencies of the i-th word in the range of n most frequent words.

Manhattan distance

can be formalized as follows:

\[ \delta_{AB} = \sum_{i = 1}^{n} | A_i - B_i | \]

which is equivalent to

\[ \delta_{AB} = \sqrt[1]{ \sum_{i = 1}^{n} | A_i - B_i |^1 } \]

(the above weird notation will soon become useful)

They are siblings!

\[ \delta_{AB} = \sqrt[2]{ \sum_{i = 1}^{n} (A_i - B_i)^2 } \]

vs.

\[ \delta_{AB} = \sqrt[1]{ \sum_{i = 1}^{n} | A_i - B_i |^1 } \]

For that reason, Manhattan and Euclidean are named L1 and L2, respectively.

An (infinite) family of distances

  • The above observations can be further generalized
  • Both Manhattan and Euclidean belong to a family of (possible) distances:

\[ \delta_{AB} = \sqrt[p]{ \sum_{i = 1}^{n} | A_i - B_i |^p } \]

where p is both the power and the degree of the root.

The norms L1, L2, L3, …

  • The power p doesn’t need to be a natural number
  • We can easily imagine norms such as L1.01, L3.14159, L1¾, L\(\sqrt{2}\) etc.
  • Mathematically, \(p < 1\) doesn’t satisfy the formal definition of a norm…
  • … yet still, one can easily imagine a dissimilarity L0.5 or L0.0001.
  • (plus, the so-called Cosine Distance doesn’t satisfy the definition either).

To summarize…

  • The p parameter is a continuum
  • Both \(p = 1\) and \(p = 2\) (for Manhattan and Euclidean, respectively) are but two specific points in this continuous space
  • p is a method’s hyperparameter to be set or possibly tuned

research question

🧐 How do the norms from a wide range beyond L1 and L2 affect text classification?

experiment

Data

Four full-text datasets used:

  • 99 English novels by 33 authors,
  • 99 Polish novels by 33 authors,
  • 28 books by 8 American Southern authors:
    • Harper Lee, Truman Capote, William Faulkner, Ellen Glasgow, Carson McCullers, Flannery O’Connor, William Styron and Eudora Welty,
  • 26 books by 5 fantasy authors:
    • J.K. Rowling, Harlan Coben, C.S. Lewis, and J.R.R. Tolkien.

Method

  • A supervised classification experiment was designed
  • Aimed at authorship attribution
  • leave-one-out cross-validation scenario
    • 100 independent bootstrap iterations…
    • … each of them involving 50% randomly selected input features (most frequent words)
    • The procedure repeated for the ranges of 100, 200, 300, …, 1000 most frequent words.
  • The whole experiment repeated iteratively for L0.1, L0.2, …, L10.
  • The performance in each iteration evaluated using accuracy, recall, precision, and the F1 scores.

results

99 English novels by 33 authors

99 Polish novels by 33 authors

28 novels by 8 Southern authors

26 novels by 5 fantasy writers

conclusions

A few observations

  • Metrics with lower \(p\) generally outperform higher-order norms.
  • Specifically, Manhattan is better than Euclidean…
  • … but values \(p < 1\) are even better.
  • Feature vectors that yield the best results (here: long vectors of most frequent words) are the most sensitive to the choice of the distance measure.

Plausible explanations

  • Small \(p\) makes it more important for two feature vectors to have fewer differing features (rather than smaller differences among many features),
  • Small \(p\) amplifies small differences (important, e.g., for low-frequency features in distinguishing between 0 difference – for two texts lacking a feature – and a small difference).

Therefore:

  • Small \(p\) norms might be one way of effectively utilizing long feature vectors.

Questions?

Sources of the texts: the ELTeC corpus, and the package stylo.

The code and the datasets: https://github.com/computationalstylistics/beyond_Manhattan

appendix

L p distances vs. Cosine Delta

English novels:

mfw Manhattan Lp distance Cosine
100 0.625 0.625 (p = 0.9) 0.666
500 0.814 0.823 (p = 0.6) 0.865
1000 0.833 0.871 (p = 0.3) 0.892

Polish novels:

mfw Manhattan Lp distance Cosine
100 0.655 0.659 (p = 0.8) 0.684
500 0.760 0.769 (p = 0.6) 0.840
1000 0.751 0.835 (p = 0.1) 0.842

99 English novels by 33 authors

99 Polish novels by 33 authors